Code generation method, code generating apparatus and computer readable storage medium

ABSTRACT

A code is generated for mapping source to target code words which allows encoding source data at reduced probability of incorrect decoding, e.g. for DNA storage. The target code words are grouped into subsets and comprise identifying and remaining portions. The identifying portions of target code words corresponding to a same subset are identical. A first code symbol set of source code words is selected for addressing the subsets. For the subsets, neighboring subsets are determined. The identifying portions of the target code words of neighboring subsets differ from those of the corresponding subset by up to a predetermined amount of symbols. Source code words are assigned where the corresponding first code symbols address the same subset to said subset such that an amount of target code words of said subset having their remaining portions identical to their neighboring subsets corresponds to an optimization criterion.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation application of U.S. Non-Provisionalpatent application Ser. No. 15/502,528, filed Feb. 8, 2017, which itselfclaims benefit, under 35 U.S.C. §365 of International ApplicationPCT/EP2015/067654 filed Jul. 31, 2015, which was published in accordancewith PCT Article 21(2) on Feb. 11, 2016, in English, and which claimsthe benefit of European Patent Application No. 14306259.4 filed Aug. 8,2014, all of which are incorporated by reference herein in theirrespective entireties.

FIELD

A code generation method and apparatus are presented. In particular, thepresent disclosure relates to a method and an apparatus for mappingsource code words to target code words, for example suitable forencoding of information for storage in synthetic nucleic acid strands,and to a corresponding computer readable storage medium.

BACKGROUND

A nucleic acid is a polymeric macromolecule and consists of a sequenceof monomers known as nucleotides. Each nucleotide consists of a sugarcomponent, a phosphate group and a nitrogenous base or nucleobase.Nucleic acid molecules where the sugar component of the nucleotides isdeoxyribose are DNA (deoxyribonucleic acid) molecules, whereas nucleicacid molecules where the sugar component of the nucleotides is riboseare referred to as RNA (ribonucleic acid) molecules. DNA and RNA arebiopolymers appearing in living organisms.

Nucleic acid molecules are assembled as chains or strands ofnucleotides. Nucleic acid molecules can be generated artificially andtheir chain structure can be used for encoding any kind of user data.For storing data in synthesized, i.e. artificially created, DNA or RNA,usually short DNA or RNA fragments (oligonucleotides, short: oligos) aregenerated. With these nucleic acid fragments, a data storage system canbe realized wherein data are stored in nucleic acid molecules. Thesynthesized nucleic acid molecules carry the information encoded by thesuccession of the different nucleotides forming the nucleic acidmolecules. Each of the synthesized nucleic acid molecules consists of asequence or chain of nucleotides generated by a bio-chemical processusing a synthesizer and represents an oligo or nucleic acid fragmentwherein the sequence or cascade of the nucleotides encodes a code wordsequence corresponding to a set of information units, e.g., sets ofinformation bits of user data. For example, in a DNA storage system,short DNA fragments are generated. These molecules can be stored and theinformation can be retrieved from the stored molecules by reading thesequence of nucleotides using a sequencer.

Sequencing is a process of determining the order of nucleotides withinthe particular nucleic acid fragment. Sequencing can be interpreted as aread process. The read out order of nucleotides is processed or decodedto recover the original information stored in the nucleic acid fragment.

In this context, the terms “nucleic acid fragment”, “oligonucleotide”and “oligo” are used interchangeably and refer to a short nucleic acidstrand. The term “short” in this context is to be understood as short incomparison to a length of natural DNA which encodes genetic instructionsused by living organisms and which may consist of millions ofnucleotides. Synthesized oligos may contain more than one, for examplemore than hundred, e.g. between 100 and 300, or several thousands ofnucleotides.

This technology enables a provision of data storage systems wherein awrite process is based on the creation of nucleic acid fragments assequences of nucleotides which encode information to be stored.

The generated nucleic acid fragments are stored, for example as solidmatter or dissolved in a liquid, in a nucleic acid storage container.The characteristics of the nucleic acid storage may depend on the amountof stored data and an expected time before a readout of the data willtake place.

Digital information storage in synthesized DNA or RNA may provide ahigh-capacity, low-maintenance information storage.

DNA storage has been investigated in “Next-generation digitalinformation storage”, Church et al., Science 337, 1628, 2012, and in“Towards practical, high-capacity, low-maintenance information storagein synthesized DNA”, Goldman et al., Nature, vol. 494, 2013.

The data can be any kind of sequential digital source data to be stored,e.g., sequences of binary or quaternary code symbols, corresponding todigitally, for example binary, encoded information, such as textual,image, audio or video data. Due to the limited oligo length, the data isusually distributed to a plurality of oligos.

In such a nucleic acid storage system the oligos are subject to severalprocessing stages: The oligos are synthesized, i.e. nucleic acid strandsto be stored are created, amplified, i.e., the number of each singleoligo is increased, e.g., to several hundreds or thousands, andsequenced, i.e., the sequence of nucleotides for each oligo is analyzed.These processing stages can be subject to errors, resulting innon-decodable or incorrectly decoded information.

DNA strands consist of four different nucleotides identified by theirrespective nucleobases or nitrogenous bases, namely, Adenine, Thymine,Cytosine and Guanine, which are denoted shortly as A, T, C and G,respectively. RNA strands also consist of four different nucleotidesidentified by their respective nucleobases, namely, Adenine, Uracil,Cytosine and Guanine, which are denoted shortly as A, U, C and G,respectively.

The information is stored in sequences of the nucleotides. Regarded asan information transmission system, such mapping from information bitsto different nucleotides can be interpreted as modulation with A, T, C,G as modulation symbols (or A, U, C and G, respectively), where thesymbol alphabet size is 4. Reversely, the decision rule from a givensymbol tuple or target code word to an information bit tuple or sourcecode word can be referred to as demodulation.

Nucleobases tend to connect to their complementary counterparts viahydrogen bonds. For example, natural DNA usually shows a double helixstructure, where A of one strand is connected to T of the other strand,and, similarly, C tends to connect to G. In this context, A and T, aswell as C and G, are called complementary. Correspondingly, A with U andG with C form pairs of complementary RNA bases.

Two sequences of nucleotides are considered “reverse complementary” toeach other, if an antiparallel alignment of the nucleotide sequencesresults in the nucleobases at each position being complementary to theircounterparts. Reverse complementarity does not only occur betweenseparate strands of DNA or RNA. It is also possible for a sequence ofnucleotides to have internal or self-reverse complementarity. As anexample, a DNA fragment is considered self-reverse complementary, if thefragment is identical to itself after complementary, reversing steps.For example, a DNA fragment AATCTAGATT is self-reverse complementary:original DNA fragment—AATCTAGATT; complementary—TTAGATCTAA; orderreversing—AATCTAGATT.

Long self-reverse complementary fragments may not be readily sequencedwhich hinders correct decoding of the information encoded in the strand.

Further, tests have shown that nucleotide run lengths, i.e. cascades orsequences of identical nucleotides may reduce sequencing accuracy if therun length exceeds a certain length.

Furthermore, as the amplification process and the sequencing introduceerrors in the oligos at different locations, many sequenced oligos maynot contain the correct information.

Therefore, a specific modulation coding should be used that allowsencoding of information or source data at a high coding efficiency whilehaving a reduced probability of incorrect decoding.

SUMMARY

According to an aspect of the invention, a code generation method formapping a plurality of source code words to a plurality of target codewords comprises

grouping the plurality of target code words into a plurality of subsetsof the target code words, the target code words comprising anidentifying portion and a remaining portion, wherein the identifyingportions of the target code words corresponding to a same subset of theplurality of subsets are identical;selecting a first set of code symbols of the source code words foraddressing the plurality of subsets;determining for the subsets one or more corresponding neighboringsubsets within the plurality of subsets, wherein the identifyingportions of the target code words of the one or more neighboring subsetsdiffer from the identifying portion of the target code words of thecorresponding subset by up to a predetermined amount of code symbols;andassigning source code words where the corresponding first set of codesymbols addresses the same subset, to said subset such that an amount ofthe target code words of said subset having their remaining portionsidentical to the corresponding remaining portions of the target codewords of their neighboring subsets corresponds to an optimizationcriterion.

Accordingly, a code generating apparatus for mapping a plurality ofsource code words to a plurality of target code words comprises

a code word grouping unit configured to group the plurality of targetcode words into a plurality of subsets of the target code words, thetarget code words comprising an identifying portion and a remainingportion, wherein the identifying portions of the target code wordscorresponding to a same subset of the plurality of subsets areidentical;a selection unit connected to the code word grouping unit and configuredto select a first set of code symbols of the source code words foraddressing the plurality of subsets;a determining unit connected to the code word grouping unit andconfigured to determine for the subsets one or more correspondingneighboring subsets within the plurality of subsets, wherein theidentifying portions of the target code words of the one or moreneighboring subsets differ from the identifying portion of the targetcode words of the corresponding subset by up to a predetermined amountof code symbols; anda mapping unit connected to the selection unit and the determining unitand configured to assign source code words where the corresponding firstset of code symbols addresses the same subset, to said subset such thatan amount of the target code words of said subset having their remainingportions identical to the corresponding remaining portions of the targetcode words of their neighboring subsets corresponds to an optimizationcriterion.

Further, a computer readable storage medium has stored thereininstructions enabling mapping a plurality of source code words to aplurality of target code words, which, when executed by a computer,cause the computer to:

-   -   group the plurality of target code words into a plurality of        subsets of the target code words, the target code words        comprising an identifying portion and a remaining portion,        wherein the identifying portions of the target code words        corresponding to a same subset of the plurality of subsets are        identical;    -   select a first set of code symbols of the source code words for        addressing the plurality of subsets;    -   determine for the subsets one or more corresponding neighboring        subsets within the plurality of subsets, wherein the identifying        portions of the target code words of the one or more neighboring        subsets differ from the identifying portion of the target code        words of the corresponding subset by up to a predetermined        amount of code symbols; and    -   assign source code words where the corresponding first set of        code symbols addresses the same subset, to said subset such that        an amount of the target code words of said subset having their        remaining portions identical to the corresponding remaining        portions of the target code words of their neighboring subsets        corresponds to an optimization criterion.

The computer readable storage medium has stored therein instructionswhich, when executed by a computer, cause the computer to perform stepsof the described method.

The source code words have a first predefined length, i.e. consist of afirst predefined amount of code symbols. The target code words have asecond predefined length, i.e. consist of a second predefined amount ofcode symbols.

In an embodiment the target code words comprise sequences of quaternarycode symbols. The source code words may comprise sequences of binarycode symbols. The usage of quaternary code symbols for target code wordsallows a direct correspondence or mapping of used symbols to DNA or RNAnucleotides or nucleobases and enables a more efficient coding than, forexample, a mapping of binary symbols 0 and 1 to two respective of thefour different nucleotides.

A neighboring subset possesses a nonzero Hamming distance to thecorresponding subset. As an example, the predetermined amount of codesymbols can be equal to 1, i.e. code words of neighboring subsets differfrom the corresponding subset by one symbol within the identifyingportion. The neighboring subsets are determined for each subset of theplurality of subsets.

In an embodiment the term “corresponds to an optimization criterion”,i.e. satisfies an optimization criterion, refers to a feature that theamount of the target code words of said subset having their remainingportions identical to the corresponding remaining portions of the targetcode words of their neighboring subsets is maximized. A maximized amountof target code words refers to the maximum possible amount of targetcode words of a subset, having their remaining portions identical to thecorresponding remaining portions of the target code words of theirneighboring subset. In another embodiment the term “corresponds to anoptimization criterion” refers to the feature that said amount of targetcode words is adapted to be a number close to but below the maximumpossible amount, e.g. 1 below the maximum possible amount.

The term “portion” of a code word does not necessarily imply that thecode symbols belonging to that portion form a sequence of consecutivesymbols within the code word. For example, the remaining portion mayembed the identifying portion or vice versa, code symbols at severaldefined positions may belong to the identifying portion, while theremaining symbols belong to the remaining portion etc.

The solution according to the aspect of the invention provides a codebook generation scheme to be used for generating code word sequencessuitable for synthesizing nucleic acid molecules containingcorresponding sequences of nucleotides. The encoding of source codewords carrying data or information units is done by concatenatingcorresponding target code words to generate code word sequences suitablefor synthesizing oligos. The coding scheme is applicable to arranginginformation units suitably to be stored in nucleic acid fragments whilebeing decodable at a reduced error probability.

The provided solution at least has the effect that the target code wordsbeing subject to single or up to the predetermined amount of symbolerrors within the identifying portion will be decoded correctly. Hence,information encoded in nucleic acid strands or oligos synthesized usingsequences of the created target code words being subject to distortionwill have an increased probability of correct decoding. The reliabilityof the sequencing of the oligos is improved, allowing provision of areliable system for storing information in nucleic acid molecules, forexample for archiving purposes.

In one embodiment target code words are removed from the plurality oftarget code words according to a decoding related criterion beforegrouping the plurality of target code words into a plurality of subsetsof the target code words. Here, the term “according to a decodingrelated criterion” refers to a dependency of the decoding or decodingaccuracy on the structure of the target code words to be decoded, i.e.on the actual sequence of consecutive symbols within a target code wordor a sequence of target code words. For example, if the target codewords serve as a basis for storing data in synthesized oligos, theperformance accuracy of the bio-chemical processes of synthesizing,amplifying and sequencing may differ depending on the particularsequence of nucleotides within an oligo generated or to be generated,respectively. Other parameters may influence performance accuracy aswell, for example a presence of other molecules or physical parameterssuch as, for example, temperature, pressure etc. In the describedembodiment potential target code words which exhibit a higherprobability of causing decoding errors are removed for increasedprobability of correct decoding.

A decoding related criterion may, for example, be a run length of codesymbols, i.e. the number of consecutive identical code symbols within atarget code word or a sequence of target code words or, respectively,consecutive identical nucleotides within an oligo or a sequence ofoligos. For example, the run lengths for an oligo AATTTGCC are 2, 3, 1,2 for A, T, G, C, respectively.

As an example, according to the decoding related criterion, target codewords that comprise a run length of identical code symbols of more thana predefined maximum run length are removed. This reduces a probabilityof decoding errors caused by run length problems. For example, thepredefined maximum run length can be 3, as experimental results haveshown that 4 or more nucleotide repetitions, such as “AAAA” or “TTTTT”should be avoided in order to achieve more reliable sequencing results.

Further, target code words that comprise a run length of identical codesymbols of more than the predefined maximum run length when beingconcatenated with another of the target code words are removed. Thisallows to avoid run lengths of identical target code symbols occurringwhen sequences of two or more code words are concatenated, for examplein order to create a code word sequence suitable to generate asynthesized oligo from. Thereby, the probability of decoding errorscaused by run length problems is further reduced.

The removal of target code words in view of run length constraintsincreases suitability of code word sequences generated from the(remaining) target code words for synthesizing a corresponding oligo, aslonger run lengths, e.g. exceeding 3, in synthesized oligos or nucleicacid fragments can be less suitable for correct sequencing.

Without the removal of target code words, e.g. due to the run lengthconstraint, each symbol of a target code word can represent twoinformation bits or binary symbols of a source code word. A possiblecoding taking into account run length constraints can be based onassigning two different target code symbols to each source code symbol.For example, for source code symbols “0” and “1” and target code symbols“A”, “T”, “G” and “C”, assigning “A” and “C” to “0”, and “G and “T” to“1”, and replacing a target code symbol by its counterpart in case a runlength of target code symbols exceeds the allowed predefined maximum runlength can be used to avoid run lengths exceeding the predefined maximumrun length. However, here each target symbol can only represent onesource code symbol.

According to the embodiment described above, even under the run lengthconstraint, the capacity for run length constrained sequences is higherthan 1. In this context, “capacity” refers to how many bits of a sourcecode word can be represented by one symbol of a target code wordasymptotically. The capacity C of an M-level run length limitedmodulation code where run lengths after modulation are limited in therange [d, k], where M is the alphabet size of modulation and d and kdenote the minimum and maximum run length, respectively, is given byC=log₂γ, where γ is the largest real root of the followingcharacteristic equation: z^(k+1)−z^(k)−(M−1)z^(k−d+1)+(M−1)=0.Accordingly, the run length constraint of avoiding run lengths exceeding3 on the modulation for DNA storage can be interpreted as to design aquaternary, run length limited code subject to d=1 and k=3. Thecorresponding capacity can be determined as C≈1.9957 bits/symbol, i.e.,each symbol (nucleotide) can asymptotically represent 1.9957 informationbits. In the context of data storage, a modulation with high modulationefficiency R/C, with code rate R bits/symbol, is desired, as the storagedensity increases with the modulation efficiency.

In one embodiment the determining step, i.e. the determining for thesubsets one or more corresponding neighboring subsets within theplurality of subsets of the target code words, comprises or is carriedout such that the identifying portions of the one or more neighboringsubsets differ from the corresponding subset by selected symbol flipscorresponding to dominant sequencing errors based on a sequencing errorprobability of nucleotides within nucleic acid strands. The amount ofneighboring subsets for a specific subset is limited by only taking intoaccount dominant symbol errors for the flipping. This additionalconstraint causes the neighboring subsets to be selected such thatprecisely for the particular subset/neighboring subset pairs the amountof common assignments is maximized, i.e. the amount of target code wordsis maximized which differ between the subset and its neighboring subsetonly by up to said predefined amount of code symbols, e.g. one codesymbol, within the identifying portion. When using the generated targetcode words for synthesizing nucleic acid strands, such as DNA strands,certain symbol flips where a symbol is decoded that differs from theinitially encoded symbol, can be dominant, i.e. occur more likely thanothers. For example, the dominant single symbol errors in DNA storageare the symbol transitions between A and G, and between C and T. Bymaximizing the amount of common assignments between two neighboringsubsets the influence on the decodability of the source code words, moreprecisely on the first set of source code symbols for assigning sourcecode words to subsets of target code words, due to dominant singlesymbol errors within the identifying portion of the target code symbolsis minimized or at least reduced. This significantly reduces theremaining error rate.

In one embodiment the pluralities of source code words and target codewords are divided into source code words and target code words of afirst code and of a second code, the target code words of the first codeand of the second code both having the properties that the reversecomplementary word of a target code word of the corresponding code stillbelongs to the corresponding code, and that there is no common code wordbetween the first code and the second code, and that a target code wordof the second code is neither equal to any portion of two cascadedtarget code words of the first code nor equal to any portion of cascadedone target code word of the first code and one target code word of thesecond code, and wherein the grouping, selecting, determining andassigning is applied to the first code. In another embodiment the secondcode instead of or in addition to the first code may be subject to thegrouping, selecting, determining and assigning. In order to avoidself-reverse complementarity and, thereby, increase correctness ofdecoding, code word sequences are generated by multiplexing code wordsof the first and the second code. This allows, for example, generationof non-self-reverse complementary nucleic acid oligos to be synthesizedbeing composed of multiplexed code words from the first and the secondcode.

If the first code is generated according to the embodiment describedabove, the second code may serve as provider of suitable delimiting codewords to avoid self-reverse complementarity. In one embodiment, forincreased coding efficiency by employing the second code for additionalinformation transmission at reduced error probability, the used secondcode can be generated according to the following: The plurality oftarget code words of the second code is grouped into a plurality ofsubsets of the target code words of the second code, the target codewords of the second code comprising an identifying portion and aremaining portion, wherein the identifying portions of the target codewords of the second code which correspond to a same subset of theplurality of subsets of target code words of the second code areidentical. A first set of code symbols of the source code words of thesecond code is selected for addressing the plurality of subsets oftarget code words of the second code. Then source code words of thesecond code where the corresponding first set of code symbols addressesthe same subset of target code words of the second code are assigned tosaid subset according to a cost function minimizing a Hamming distancebetween the remaining portions of the target code words of the secondcode.

For example, the identifying portions of the target code words of thesecond code can be embedded between two parts of the correspondingremaining portions. Further, the source code words may, for example, bebinary code words of a first predefined length and the target code wordsmay, for example, be quaternary code words of a second predefinedlength.

As an example, the cost function minimizing the Hamming distance betweenthe remaining portions of the target code words of the second code maydepend on a symbol error probability. According to this exampleembodiment, the cost function does not treat each possible errorequally, but takes into account that, depending on the application,certain symbol errors may occur more likely than others. This allowsadaptation of the coding scheme to specific error constraints of thetargeted application.

As an example, the symbol error probability is based on a sequencingerror probability of nucleotides within nucleic acid strands. Thisallows adaptation of the coding scheme to the specific constraints ofnucleic acid storage systems such as DNA or RNA storage systems.

In one embodiment, at least one code word sequence from one or more ofthe target code words is generated; and at least one nucleic acidmolecule comprising a segment wherein a sequence of nucleotides isarranged to correspond to the at least one code word sequence issynthesized. A nucleic acid molecule may, for example, be a DNA fragmentor an RNA fragment generated by a synthesizer device which receivessequences of the generated code words. In other words, DNA or RNA oligosare synthesized according to sequences of the generated code words. Thesynthesized oligos carry the information encoded by the succession ofthe nucleotides forming the oligos. These molecules can be stored andthe information can be retrieved by reading the sequence of nucleotidesusing a sequencer and decoding the extracted code words.

For example, for the embodiment making use of two different codes,oligos are synthesized from at least one code word sequence which isgenerated from one or more of the target code words, wherein after apredefined amount of first code words at least one second code word isinserted. The oligo contains a segment wherein a sequence of nucleotidesis arranged to correspond to the code word sequence. Many more than onenucleic acid molecule may be generated.

The amount of nucleic acid molecules or oligos generated or synthesizedby a synthesizer corresponds to the amount of generated code wordsequences. At least one nucleic acid molecule is synthesized for eachcode word sequence. However, multiple oligos may be generated for eachor a selected, for example high-priority, subset of the code wordsequences. The synthesizing step may, for example, be carried out aftergeneration of all code word sequences or after generation of each of thesequences.

Further, in an embodiment the apparatus or device which is configured tocarry out the method described above is comprised in a nucleic acidstorage system, such as a DNA storage system or an RNA storage system.For example, the nucleic acid storage system further comprises a nucleicacid storage unit or container and a sequencer unit or device configuredto sequence the synthesized and stored nucleic acid molecules toretrieve and decode the encoded code word sequence.

While not explicitly described, the present embodiments may be employedin any combination or sub-combination.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 schematically illustrates a code generation method for mapping aplurality of source code words to a plurality of target code wordsaccording to an embodiment of the invention;

FIG. 2 schematically illustrates steps of a code generation method formapping a plurality of source code words to a plurality of target codewords according to an embodiment of the invention;

FIG. 3 schematically illustrates a code generation method for mapping aplurality of source code words to a plurality of target code wordsaccording to another embodiment of the invention;

FIG. 4 schematically illustrates an example of a neighboring subsetgraph; and

FIG. 5 schematically illustrates a code generating apparatus for mappinga plurality of source code words to a plurality of target code wordsaccording to an embodiment of the invention.

DETAILED DESCRIPTION OF EMBODIMENTS

For a better understanding, the invention will now be explained in moredetail in the following description with reference to the drawings. Itis understood that the invention is not limited to these exemplaryembodiments and that specified features can also expediently be combinedand/or modified without departing from the scope of the presentinvention as defined in the appended claims.

Referring to FIG. 1, a code generation method 100 for mapping aplurality of source code words to a plurality of target code wordsaccording to an embodiment of the invention is schematically shown. Theterm “code word” refers to a sequence of code symbols such as binary orquaternary code symbols. “source code words” are used to provide piecesof information, e.g. binary encoded bitstreams, whereas “target codewords” are modulated sequences of code symbols used to carry the piecesof information in a transcoded format suitable for generatingsynthesized oligos from.

In a first step 101 a plurality of source code words and a plurality oftarget code words are provided. In another embodiment these initialpluralities of source and target code words may already be available.

In a second step 102 the plurality of target code words is grouped intoa plurality of subsets of the target code words. The target code wordscomprise an identifying portion and a remaining portion, wherein theidentifying portions of the target code words corresponding to a samesubset of the plurality of subsets are identical. In other words, eachtarget code word of a subset is identified by the same identifierwherein the identifier comprised in the identifying portion may berepresented by a single or multiple code symbols being eitherconsecutive or distributed across the code word.

In a third step 103 a first set of code symbols of the source code wordsis selected for addressing the plurality of subsets.

The first set of code symbols corresponds to an identifying portion ofthe source code words.

In a fourth step 104 for the subsets one or more correspondingneighboring subsets within the plurality of subsets are determined. Theidentifying portions of the target code words of the one or moreneighboring subsets differ from the identifying portion of the targetcode words of the corresponding subset by up to a predetermined amountof code symbols, for example one code symbol.

In a fifth step 105 source code words where the corresponding first setof code symbols addresses the same subset are assigned to the subset,i.e. said same subset, such that an amount of the target code words ofthe subset which have their remaining portions identical to thecorresponding remaining portions of the target code words of theirneighboring subsets corresponds to an optimization criterion.

Additionally referring to FIG. 2, steps of a code generation method formapping a plurality of source code words to a plurality of target codewords according to an embodiment of the invention are schematicallyshown. The shown steps refer to an example of an embodiment of the step101 of providing source and target code words according to the methodshown in FIG. 1. Here, the provision comprises target code words beingpreselected according to run length constraints and the pluralities ofsource and target code words being divided into a first and a secondcode.

In a step 201 the source code words and an initial plurality of targetcode words are provided.

In a next step 202 target code words are removed from the plurality oftarget code words, i.e. the initial plurality of target code words,according to a decoding related criterion before grouping the pluralityof target code words into a plurality of subsets of the target codewords, wherein according to the decoding related criterion target codewords that comprise a run length of identical code symbols of more thana predefined maximum run length are removed. In an example embodiment,this predefined maximum run length is set to three code symbols. Inother embodiments, the predefined maximum run length can be set to, forexample, two, four, or other values.

In a next step 203 target code words that comprise a run length ofidentical code symbols of more than the predefined maximum run lengthwhen being concatenated with another of the target code words areremoved. This eliminates all target code words that fail to meet the runlength constraint, either alone or in combination with another of thetarget code words. Therefore, sequences of multiple target code wordswill meet the run length constraint.

In another step 204 both the source code words and target code words aredivided into a first and a second code suitable to avoid self-reversecomplementary code word sequences. The pluralities of source code wordsand target code words are divided into source code words and target codewords of a first code and of a second code both having the propertiesthat the reverse complementary code word of a target code word of thecorresponding code still belongs to the corresponding code and thatthere is no common code word between the first code and the second code.The steps of grouping 102, selecting 103, determining 104 and assigning105 as shown in FIG. 1 are carried out for the first code. In anotherembodiment these steps can be applied, additionally or instead, to thesecond code.

Referring to FIG. 3, a code generation method 300 for mapping aplurality of source code words to a plurality of target code wordsaccording to another embodiment of the invention is schematically shown.Without limitation of generality, target code words consist ofquaternary code symbols A, T, C, G corresponding to DNA nucleobases andare represented by integers 0, 1, 2, 3, respectively, whereas sourcecode words consist of binary code symbols represented by integers 0 and1.

In the shown embodiment, in a first step 301 all quaternary target codewords, i.e. all quaternary symbol tuples, of a predefined length L aregenerated. The term “tuple” is used to refer to an ordered list ofelements, such as a sequence of code symbols.

In a second step 302 all symbol tuples violating the (d,k) run lengthconstraint by themselves or by cascading two symbol tuples areeliminated from the set of target code words, i.e. from the generatedquaternary symbol tuples. The run length limitation is set by lowerlimit d and upper limit k, for example with parameters d=1 and k=3. Withthis example parameters, run lengths will be limited from 1 to 3 aftermodulation. Any modulation fulfilling the run length constraints has acode rate less than the capacity of about 1.9975 bits/symbol. As anexample, the modulation code is generated by mapping bit tuples, i.e.binary source code words, of length 9 to (target) symbol tuples, i.e.target code words, of length 5. Other lengths may be chosen instead. Forthe chosen example parameters, the corresponding code rate is 1.8bits/symbol.

For illustration of the shown embodiment, bit tuples of source codewords are denoted as (u₁,u₂,u₃,u₄,u₅,u₆,u₇,u₈,u₉) and quaternary symboltuples of target code words are denoted as (x₁,x₂,x₃,x₄,x₅), before andafter modulation, where u_(i)ε{0,1}, 1≦i≦9 and x_(i)ε{0,1,2,3}, 1≦i≦5.

For example, for the above-mentioned chosen parameters, steps 301 and302 are performed as follows to fulfill the run length constraints:According to step 301, all 1024 quaternary target symbol tuples oflength 5 from (0,0,0,0,0) to (3,3,3,3,3) are constructed. According tostep 302, target symbol tuples obtained in step 301, which begin or endwith two same symbols are eliminated. In other words, target symboltuples with x₁≠x₂ and x₄≠x₅ are maintained, so that concatenating twotarget symbol tuples still fulfills run length constraints d=1, k=3.

In a next step 303, if necessary, target symbol tuples not havingreverse complementary counterparts are eliminated. The remaining reversecomplementary pairs of target symbol tuples, i.e. target code words, aredenoted as code C.

With the chosen example parameters as described above, the resulting setof target symbol tuples automatically only contains target symbol tupleswith reverse complementary counterparts.

In a next step 304 individual reverse complementary pairs of target codesymbol tuples, i.e. target code words, are found in C which fulfillself-reverse complementary constraints (i) and (ii) below as aprerequisite for enabling generation of non-self-reverse complementarytarget code word sequences. The resulting set has 576 target code words(length-5 target symbol tuples belonging to C). These target symboltuples, i.e. target code words, exhibit at least the followingproperties: (i) they are not self-reverse complementary; and (ii) theself-reverse complementary counterpart of any code word is also one codeword within C. In other words, XεC, and X denote a code word and itsreverse complementary counterpart, respectively, wherein X≠X, and XεC.In this context, X and X are called a reverse complementary pair. Inother words, code C is composed of reverse complementary pairs. In theexample, there are 288 reverse complementary pairs in C.

In a next step 305 combinations of reverse complementary pairs of targetcode words fulfilling the self-reverse complementary constraints arefound and selected as code C₂. Remaining reverse complementary pairs aredenoted as code C₁. In more details, to avoid self-reversecomplementarity of target code words sequences, code C is divided intotwo subsets, denoted as code C₁ and C₂. DNA fragments to be synthesizedare composed by multiplexing code words from C₁ and C₂.

As an example, the construction of code C₁ and C₂ according to step 305can be performed in two phases:

In a first phase, a reverse complementary pair comprised in code C,denoted as Y=(y₁,y₂,y₃,y₄,y₅) and Y=(y ₁,y ₂,y ₃,y ₄,y ₅), is selectedto construct code C₂, and remaining 287 reverse complementary pairs areselected to belong to C₁. For example, Y=(0,2,0,1,0) and Y=(1,0,1,3,1)may be selected to construct C₂. Other selections are possible. Only onecode word of the reverse complementary pair from C₂, for example Y, isused to be multiplexed with code words from C₁ to generate sequencessuitable for synthesizing DNA fragments, while all code words from C₁can be chosen for multiplexing. Using only Y (while Y is not used) formultiplexing ensures that for a DNA fragment its reverse complementarycounterpart only includes Y. Otherwise, a DNA fragment cannot beguaranteed to be self-reverse complementary for some arrangements.

For explanation purposes, a contradicted example is given using both Yand Y for multiplexing. A sequence potentially suitable for a DNAfragment is constructed as S=[X₁,X₂,Y,X₃,X ₃,Y,X ₂,X ₁], and its reversecomplementary counterpart is given by S=[X₁,X₂,Y,X₃,X ₃,Y,X ₂,X ₁]=Sindicating that the originally constructed DNA fragment is self-reversecomplementary. On the other hand, if only Y was used for S, Y will neverappear in S.

For example, (x₁,x₂,x₃,x₄,x₅) and (x₁′,x₂′,x₃′,x₄′,x₅′) denote twotarget code words from C₁. These code words can but do not have to bedifferent. Since C₁ and C₂ are exclusive, i.e. there is no common codeword belonging to C₁ and also to C₂, no target code word of C₁ is equalto Y. If neither any combination of two code words from C₁ nor anycombination of one code word from C₁ and Y is equal to Y, any DNAfragment including Y will not be self-reverse complementary.

Specifically, it is checked whether Y is not equal to any one of thefollowing combinations: (x₅,x₁′,x₂′,x₃′,x₄′), (x₄,x₅,x₁′,x₂′,x₃′),(x₃,x₄,x₅,x₁′,x₂′), (x₂,x₃,x₄,x₅,x₁′), (x₅,y ₁,y ₂,y ₃,y ₄), (x₄,x₅,y₁,y ₂,y ₃), (x₃,x₄,x₅,y ₁,y ₂), (x₂,x₃,x₄,x₅,y ₁), (y ₂,y ₃,y ₄,y ₅,x₁),(y ₃,y ₄,y ₅,x₁,x₂), (y ₄,y ₅,x₁,x₂,x₃), (y ₅,x₁,x₂,x₃,x₄). If Y is notequal to any of these combinations and a code word sequence includes Y,Y will not appear at any position of the reverse complementarycounterpart of the code word sequence, making it suitable forsynthesizing a corresponding DNA fragment.

For the concrete example as described above, a total of 18 reversecomplementary pairs can be used to construct C₂:

{(0,1,1,1,0), (1,0,0,0,1)}, {(2,1,1,1,0), (1,0,0,0,3)}, {(3,1,1,1,0),(1,0,0,0,2)}, {(0,2,2,2,0), (1,3,3,3,1)}, {(1,2,2,2,0), (1,3,3,3,0)},{(3,2,2,2,0), (1,3,3,3,2)}, {(0,3,3,3,0), (1,2,2,2,1)}, {(2,3,3,3,0),(1,2,2,2,3)}, {(2,0,0,0,1), (0,1,1,1,3)}, {(3,0,0,0,1), (0,1,1,1,2)},{(0,2,2,2,1), (0,3,3,3,1)}, {(3,2,2,2,1), (0,3,3,3,2)}, {(2,3,3,3,1),(0,2,2,2,3)}, {(2,0,0,0,2), (3,1,1,1,3)}, {(3,0,0,0,2), (3,1,1,1,2)},{(2,1,1,1,2), (3,0,0,0,3)}, {(2,3,3,3,2), (3,2,2,2,3)}, {(2,0,0,0,3),(2,1,1,1,3)}.

The pairs can be found by checking all possible divisions of C into C₁and C₂.

If only one pair is used to construct C₂, and therefore only one codeword Y is used to construct sequences for synthesizing DNA fragments bymultiplexing code words from C₁ and Y, code C₂ is used solely to avoidself-reverse complementarity. If more reverse complementary pairs areused to construct C₂, code C₂ can also be used to encode and transmitinformation in addition to avoiding self-reverse complementarity.

Setting forth the example given above, in a second phase, 16 reversecomplementary pairs are selected from the obtained 18 pairs to constructC₂, and the remaining 272 complementary pairs are selected to constructC₁. As a result of the first phase, any code word in C₂ is not equal toany combination of two code words from C₁. For example, (x₁,x₂,x₃,x₄,x₅)and Y=(y₁,y₂,y₃,y₄,y₅) are code words from C₁ and C₂, respectively. Itis checked whether Y is not equal to any one of the followingcombinations:

(x₅,y₁′,y₂′,y₃′,y₄′), (x₄,x₅,y₁′,y₂′,y₃′),(x₃,x₄,x₅,y₁′,y₂′,y₃′),(x₂,x₃,x₄,x₅,y₁′),(y₂′,y₃′,y₄′,y₅′,x₁),(y₃′,y₄′,y₅′,x₁,x₂),(y₄′,y₅′,x₁,x₂,x₃),(y₅′,x₁,x₂,x₃,x₄), where Y′=(y₁′, y₂′,y₃′,y₄,y₅′) denotes a code wordfrom C₂.

Again, only one code word from each reverse complementary pair in C₂ isused to be multiplexed to construct sequences for generating DNAfragments, while there is no limitation to choose code words from C₁ formultiplexing.

If all 32 code words (corresponding to 16 code word pairs) from C₂ passthe check, they can be used to store 4 bits of information, as only onecode word from each reverse complementary pair in C₂ is used to bemultiplexed, in addition to be used to avoid self-reversecomplementarity in conjunction with code words from C₁.

If not all 32 code words pass the check, 8 reverse complementary pairsare used to construct C₂, and the above check is carried out again. Ifin this case the check is passed, 3 bits information can be stored usingC₂. This procedure is continued until a set of reverse complementarypairs can be found to pass the check.

Setting forth the example above, any combination of 16 reversecomplementary pairs from 18 pairs passes the check. Therefore, anycombination of 16 reverse complementary pairs can be used as C₂. Withoutlimitation of generality, the following 16 reverse complementary pairsare used:

{(0,1,1,1,0), (1,0,0,0,1)}, {(2,1,1,1,0), (1,0,0,0,3)}, {(3,1,1,1,0),(1,0,0,0,2)}, {(0,2,2,2,0), (1,3,3,3,1)}, {(1,2,2,2,0), (1,3,3,3,0)},{(3,2,2,2,0), (1,3,3,3,2)}, {(0,3,3,3,0), (1,2,2,2,1)}, {(2,3,3,3,0),(1,2,2,2,3)}, {(2,0,0,0,1), (0,1,1,1,3)}, {(3,0,0,0,1), (0,1,1,1,2)},{(0,2,2,2,1), (0,3,3,3,1)}, {(3,2,2,2,1), (0,3,3,3,2)}, {(2,3,3,3,1),(0,2,2,2,3)}, {(2,0,0,0,2), (3,1,1,1,3)}, {(3,0,0,0,2), (3,1,1,1,2)},{(2,1,1,1,2), (3,0,0,0,3)}.

Consequently, C₂ can be used to store 4 bits per code word, and thereare 544 code words in C₁, enabling storage of 9 bits per code word. Ifone code word from C₂ is inserted after every n_(s) code words from C₁,every 5(n_(s)+1) quaternary symbols can store 4+9n_(s) information bits,i.e., the code rate is calculated by

$\frac{4 + {9n_{s}}}{5\left( {n_{s} + 1} \right)}.$

For example, for n_(s)=10, the code rate is about 1.709 bits/symbol.

Still referring to FIG. 3, in a next step 306 assignments between bittuples, i.e. source code words, and symbol tuples, i.e. target codewords, for code C₂ are found, which minimize a bit error rate afterdemodulation.

Setting forth the example given above, one code word from each reversecomplementary pair in C₂ is used to store 4 bits of information. Withoutlimitation of generality, the following target code words can beselected:

{(0,1,1,1,0), (2,1,1,1,0), (3,1,1,1,0), (0,2,2,2,0), (1,2,2,2,0),(1,3,3,3,2), (0,3,3,3,0), (2,3,3,3,0), (2,0,0,0,1), (3,0,0,0,1),(0,2,2,2,1), (3,2,2,2,1), (2,3,3,3,1), (2,0,0,0,2), (3,0,0,0,2),(2,1,1,1,2)}.

One common property of these code words is that for fixed middle3-symbol tuples there are 4 code words, and there are four differentmiddle 3-symbol tuples. Therefore, above target code words can bedivided into four subsets according to the middle 3-symbol tuple. Twoinformation bits can be mapped to the middle 3-symbol tuple, and theother two information bits can be assigned dependent on the begin/endsymbols of the code words.

For example, for (u₁,u₂,u₃,u₄) being an information tuple, i.e. a sourcecode word, to be mapped to a code word in C₂, the first two bits can bemapped to the middle 3-symbols according to Table A:

TABLE A u₁, u₂ x₂, x₃, x₄ 0, 0 0, 0, 0 0, 1 1, 1, 1 1, 0 2, 2, 2 1, 1 3,3, 3

For demodulation, i.e. the decision from a sequenced 3-symbol tuple to a2-bit tuple, the Hamming distance, i.e. the number of different symbolsbetween two symbol tuples, can be used as a decision criterion. Thesymbol tuple in the above lookup table with the minimum Hamming distanceto the sequenced symbol tuple will be decided. Therefore, one singlesymbol error, causing one synthesized symbol being sequenced to adifferent symbol than the correct one, does not cause any bit error. Forexample, a bit tuple 0,0 is modulated to a symbol tuple 0,0,0, whichwill be used for synthesizing. If one symbol error occurs aftersequencing, and incorrect symbol tuple, for example 1,0,0 will besequenced, but after calculating Hamming distances between symbol tuplesin the lookup table, the symbol tuple 0,0,0 will be decided to be thecorrect one, resulting in no bit error.

Further, u₃,u₄ are mapped to target symbols x₁,x₅ such as to minimizethe bit error rate. For example, for the case x₂,x₃,x₄=0,0,0,x₁,x₅ε{(2,1),(3,1),(2,2), (3,2)} there are a total of 4·3·2·1=24possible mappings from u₃,u₄ to x₁,x₅. Due to single symbol errors atwo-symbol tuple (x₁,x₅)ε{(2,1),(3,1),(2,2),(3,2)} may be changed to beanother tuple (x₁′,x₅′)ε{(2,1),(3,1),(2,2),(3,2)}. For example,(x₁,x₅)=(2,1) is changed to (x₁′,x₅′)=(3,1), denoted as(x₁,x₅)→(x₁′,x₅′). By listing all cases of such single symbol errors,the total number of resulting bit errors can be evaluated as

J=Σ _((x) ₁ _(,x) ₅ _()→(x) ₁ _(′,x) ₅ _(′)) d _(H)((u ₃ ,u ₄),(u ₃ ′,u₄′))  (eq. 1)

where d_(H)((u₃,u₄),(u₃′,u₄′)) denotes the Hamming distance between(u₃,u₄) and (u₃′,u₄′), and (u₃′,u₄′) and (u₃, u₄) are mapped to (x₁,x₅)and (x₁′,x₅′) for a specific mapping. All possible 24 mappings aretested according to the cost function given in (eq. 1). And the mappingminimizing (eq. 1) is selected as an appropriate mapping between (u₃,u₄)and (x₁,x₅). One such mapping is shown in Table B:

TABLE B u₃, u₄ x₁, x₅ 0, 0 2, 1 0, 1 3, 1 1, 0 2, 2 1, 1 3, 2

Consequently, the mapping between (u₁,u₂,u₃,u₄) and (x₁,x₂,x₃,x₄,x₅) fora fixed middle 3-symbol x₂,x₃,x₄=0,0,0 is shown in Table C:

TABLE C u₁, u₂, u₃, u₄ x₁, x₂, x₃, x₄, x₅ 0, 0, 0, 0 2, 0, 0, 0, 1 0, 0,0, 1 3, 0, 0, 0, 1 0, 0, 1, 0 2, 0, 0, 0, 2 0, 0, 1, 1 3, 0, 0, 0, 2

Mappings for other fixed middle 3-symbol patterns can be determinedaccordingly. In summary, for the given example the modulation table forC₂ is obtained as shown in Table D:

TABLE D (u₁, u₂, u₃, u₄) (x₁, x₂, x₃, x₄, x₅) 0, 0, 0, 0 2, 0, 0, 0, 10, 0, 0, 1 3, 0, 0, 0, 1 0, 0, 1, 0 2, 0, 0, 0, 2 0, 0, 1, 1 3, 0, 0, 0,2 0, 1, 0, 0 0, 1, 1, 1, 0 0, 1, 0, 1 2, 1, 1, 1, 0 0, 1, 1, 0 3, 1, 1,1, 0 0, 1, 1, 1 2, 1, 1, 1, 2 1, 0, 0, 0 0, 2, 2, 2, 0 1, 0, 0, 1 1, 2,2, 2, 0 1, 0, 1, 0 0, 2, 2, 2, 1 1, 0, 1, 1 3, 2, 2, 2, 1 1, 1, 0, 0 1,3, 3, 3, 2 1, 1, 0, 1 0, 3, 3, 3, 0 1, 1, 1, 0 2, 3, 3, 3, 0 1, 1, 1, 12, 3, 3, 3, 1

If symbol error probabilities are available, the cost function (eq. 1)can be modified as

J _(p)=Σ_((x) ₁ _(,x) ₅ _()→(x) ₁ _(′,x) ₅ _(′)) P{(x ₁ ,x ₅)→(x ₁ ′,x₅′)}d _(H)((u ₃ ′,u ₄′),(u ₃ ′,u ₄′))  (eq. 2)

As an example, P{(2,1)→(3,1)}=P{2→3} is the probability that a symbol 2(corresponding to nucleotide C) is synthesized, but a symbol 3(corresponding to nucleotide G) is sequenced. If such symbol errorprobabilities are available, appropriate mapping can be found tominimize the cost function (eq. 2).

Still referring to FIG. 3, in a next step 307 assignments between bittuples, i.e. source code words, and symbol tuples, i.e. target codewords, for code C₁ are found which minimize a bit error rate afterdemodulation. As mentioned before, according to the example given abovethere are 544 target code words in C₁. A mapping rule is determined toassign source code words (u₁,u₂,u₃,u₄,u₅,u₆,u₇,u₈,u₉) to target codewords (x₁,x₂,x₃,x₄,x₅), such that the bit error rate after demodulationis minimized.

At first, the code word portion x₁,x₃,x₅ is considered. It can beverified that for each of 64 different combinations for x₁,x₃,x₅, thereare 8 or more code words in C₁. Therefore, x₁,x₃,x₅ can be assigned to 6bits. Without limiting the generality, u₁,u₂ are mapped to x₁; u₃,u₄ aremapped to x₃; and u₅,u₆ are mapped to x₅. For example, one mapping canbe defined as

x ₁ =u ₁+2u ₂ ,x ₃ =u ₃+2u ₄ ,x ₅ =u ₅+2u ₆.  (eq. 3)

As another example, if symbol error probabilities are available, adifferent mapping other than (eq. 3) resulting in less bit errorprobability can be employed. In other words, the following cost functioncan be used to find an appropriate mapping:

J _(p)=Σ_(x) ₁ _(→x) ₁ _(′) P{x ₁ →x ₁ ′}d _(H)((u ₁ ,u ₂),(u ₁ ′,u₂′))  (eq. 4)

Similar cost functions can be used for mapping between u₃,u₄ and x₃, andbetween u₅,u₆ and x₅.

According to x₁,x₃,x₅, C₁ is divided into 64 subsets, denoted as S₁,S₂,. . . ,S₆₄, where the index for S_(i) is equal to i=x₁+4x₃+16x₅+1. Forexample,S₁={(01010),(02010),(03010),(01020),(02020),(03020),(01030),(02030),(03030)},where x₁=x₃=x₅=0.

The target of assigning information bits, i.e. source code words, tosymbols, i.e. target code words, while minimizing the bit errorprobability, is carried out on a subset basis. In this context, theconcept of neighboring subsets is used. Since each subset is indexed byx₁,x₃,x₅ as the identifying portion of the target code word, aneighboring subset is obtained by flipping a predefined amount ofsymbols, for example one symbol, of x₁,x₃,x₅. In the shown embodimentthe number of neighboring subsets for a specific subset is limited, asonly dominant symbol errors are taken into account for the flipping. Asan example, the dominant single symbol errors for synthesizing andsequencing DNA fragments are the symbol transitions between A and G, orbetween C and T, or equivalently, between 0 and 3 or between 2 and 1.Therefore, in the described example each subset has exactly threeneighboring subsets. For example, S₁ has x₁=0,x₃=0,x₅=0, and itsneighboring subsets will have x₁=3,x₃=0,x₅=0, or x₁=0, x₃=3,x₅=0, orx₁=0,x₃=0,x₅=3. Hence, neighboring subsets of S₁ are S₄,S₁₃,S₄₉.

Additionally referring to FIG. 4, an example of a neighboring subsetgraph is schematically shown. The neighboring subset graph is obtainedby connecting neighboring subsets, where the numbers on the branchesbetween two neighboring subsets denote the number of common x₂,x₄combinations between them.

As an example,S₁₃={(01310),(02310),(03310),(01320),(02320),(03320),(01330),(02330)},which has 8 common x₂,x₄ combinations with S₁, namely{11,21,31,12,22,32,13,23}. For the assignments of 3 bits u₇,u₈,u₉ tox₂,x₄, the number of common assignments between two neighboring subsetsis maximized, so that the influence on u₇,u₈,u₉ due to dominant singlesymbol errors for x₁,x₃,x₅ is minimized. In other words, if theassignments of 3 bits u₇,u₈,u₉ to x₂,x₄ are given for S₁, the sameassignments will be applied for S₁₃.

On the other hand, S₄ has only 6 common assignments with S₁. Therefore,further 2 assignments are needed for S₄, which can be found similarlyaccording to (eq. 1) or (eq. 2) to minimize bit error probability afterdemodulation.

Setting forth the example above, a mapping for S₁ is determined in orderto assign three bits to two symbols in the set{11,21,31,12,22,32,13,23}. A first example of a mapping is given inTable E:

TABLE E u₇, u₈, u₉ x₂, x₄ 000 11 100 21 010 31 110 12 001 22 101 31 01113 111 23

As an example, it is assumed that due to single symbol errors a codeword in the subset is changed to be another code word in the subset. Forexample, 11 is modified to be 21, and bit tuple 000 will be decided as001 during demodulation, causing 1 bit error. By listing all cases ofsuch single symbol errors, the total number of resulting bit errors canbe evaluated as

J=Σ _((x) ₂ _(,x) ₄ _()→(x) ₂ _(′,x) ₄ _(′)) d _(H)((u ₇ ,u ₈ ,u ₉),(u ₇′,u ₈ ′,u ₉′)  (eq. 5)

where (4,4) is caused by single symbol error applied to (x₂,x₄), andboth (x₂′,x₄′) and (4,4) are combinations within S₁. For the aboveexample, J=51. Totally, there are 8·7·6·5·4·3·2·1=40320 possiblemappings between u₇,u₈,u₉ and x₂,x₄. All mappings are tested withrespect to evaluating the cost function (eq. 5). The mapping resultingin the minimal J value is selected as an appropriate mapping. One suchmapping is shown in the Table F:

TABLE F u₇, u₈, u₉ x₂, x₄ 000 22 100 21 010 32 110 31 001 23 101 13 01112 111 11

Here, the corresponding cost function results in J=36. Consequently, themapping rule for S₁ is found as

TABLE G (u₁, u₂, u₃, u₄, u₅, u₆, u₇, u₈, u₉) (x₁, x₂, x₃, x₄, x₅)000000000 02020 000000100 02010 000000010 03020 000000110 03010000000001 02030 000000101 01030 000000011 01020 000000111 01010

If symbol error probabilities are available, the cost function employingsuch error probabilities can be used to find an appropriate mapping:

J _(P)=Σ_((x) ₂ _(,x) ₄ _()→(x) ₂ _(′,x) ₄ _(′)) P{(x ₂ ,x ₄)→(x ₂ ′,x₄′)}d _(H)((u ₇ ,u ₈ ,u ₉),(u ₇ ′,u ₈ ′,u ₉′))  (eq. 6)

Referring to the neighboring subset graph shown in FIG. 4, the mappingrule between u₇,u₈,u₉ and x₂,x₄ is also suitable for S₁₃. For S₄, commonassignments between S₄ and its neighbors are checked. There are 6assignments fixed to these for S₁, as shown in Table H:

TABLE H u₇, u₈, u₉ x₂, x₄ 000 22 100 21 001 23 101 13 011 12 111 11

And there are 6 common assignments between S₄ and S₅₂ forx₂,x₄ε{22,21,12,11,01,02} and 9 common assignments between S₄ and S₁₆for x₂,x₄ε{22,21,23,13,12,11,01,02,03}. Therefore, to maximize thecommon assignments between neighboring subsets, x₂,x₄ε{22,21,211,01,02}is used to assign 3 bits to them.

Since 6 assignments between u₇,u₈,u₉ and x₂,x₄ are already fixed, anadditional, suitable mapping between u₇,u₈,u₉ε{010,110} andx₂,x₄ε{01,02} is determined. Again, cost functions (eq. 5) or (eq. 6)can be used to find an appropriate mapping, while 6 assignments betweenu₇,u₈,u₉ and x₂,x₄ are fixed. And by employing (eq. 5), a mapping ruleis found for S₄, as shown in Table I: (For S₄,x₁=3,x₃=x₅=0)

TABLE I (u₁, u₂, u₃, u₄, u₅, u₆, u₇, u₈, u₉) (x₁, x₂, x₃, x₄, x₅)110000000 32020 110000100 32010 110000010 30020 110000110 30010110000001 32030 110000101 31030 110000011 31020 110000111 31010

Similar procedures can be carried out for other states in theneighboring subset graph FIG. 4. Applying the same procedure for allsubsets, a modulation table for code C₁ can be obtained as shown inTABLE J below. TABLE J comprises 256 lines and 4 columns, wherein thefirst and the third column show binary source code words (u₁,u₂, . . .,u₈,u₉) and the second and the fourth column show quaternary target codewords (x₁,x₂,x₃,x₄,x₅) assigned to the source code words in the sameline of the first and the third column, respectively, resulting in acode book containing 512 source code word/target code word mappings.

Hence, a method for generating a modulation code with high efficiency isprovided that limits run lengths of modulation sequences, avoidsself-reverse complementarity, and minimizes the bit error rate afterdemodulation.

Referring to FIG. 5, a code generating apparatus 500 for mapping aplurality of source code words to a plurality of target code wordsaccording to an embodiment of the invention is schematicallyillustrated. The shown apparatus allows implementing the advantages andcharacteristics of the described code generation method as part of anapparatus for mapping a plurality of source code words to a plurality oftarget code words.

The apparatus 500 has a first input 501 for receiving target code wordsand a second input 502 for receiving source code words. In anotherembodiment, both inputs can be implemented as a single input orinterface. The code words are received from a memory device or aprocessing device arranged to generate the code words. In an embodimentthe memory device or processing device can be comprised in the apparatus500.

The apparatus 500 comprises a code word grouping unit 503 configured togroup the plurality of target code words ceased through the first input501 into a plurality of subsets of the target code words, the targetcode words comprising an identifying portion and a remaining portion,wherein the identifying portions of the target code words correspondingto a same subset of the plurality of subsets are identical.

The apparatus 500 further comprises a selection unit 504 connected tothe code word grouping unit 503 and configured to select a first set ofcode symbols of the source code words for addressing the plurality ofsubsets. The source code words are received through the second input502.

Further, a determining unit 505 is connected to the code word groupingunit 503. It is configured to determine for the subsets one or morecorresponding neighboring subsets within the plurality of subsets,wherein the identifying portions of the target code words of the one ormore neighboring subsets differ from the identifying portion of thetarget code words of the corresponding subset by up to a predeterminedamount of code symbols.

Further, the apparatus 500 comprises a mapping unit 506 connected atleast to the selection unit 504 and the determining unit 505. It isconfigured to assign source code words where the corresponding first setof code symbols addresses the same subset, to said subset such that anamount of the target code words of said subset having their remainingportions identical to the corresponding remaining portions of the targetcode words of their neighboring subsets corresponds to an optimizationcriterion.

The generated target code words can be output and stored in a memoryetc. In the embodiment shown in FIG. 5, the mapping unit 506 isconnected to a code word sequence generating unit 507 which isconfigured to generate at least one code word sequence from one or moreof the target code words. The code word sequences are provided to asynthesizer unit 508 configured to synthesize at least one nucleic acidmolecule comprising a segment wherein a sequence of nucleotides isarranged to correspond to the at least one code word sequence. In theembodiment shown in FIG. 5, the illustrated apparatus 500 comprises thesynthesizer unit 508 connected to receive the generated code wordsequences. It is configured to synthesize nucleic acid molecules, forexample DNA or RNA strands, each containing a segment wherein a sequenceof nucleotides is arranged to correspond to a particular code wordsequence. In another embodiment, the apparatus does not comprise thesynthesizer unit but is connected or connectable to it by means of aninterface.

In an embodiment, the apparatus 500 is a device being part of anotherapparatus or system, such as a storage system, e.g. a DNA storage systemor RNA storage system.

The apparatus 500 may, for example, be programmable logic circuitry or aprocessing device arranged to generate the code, connected to orcomprising at least one memory device for storing the code.

The code word grouping unit 503, the selection unit 504, the determiningunit 505 and the mapping unit 506, and also the code word sequencegenerating unit 507 may, for example, be provided as separate devices,jointly as at least one device or logic circuitry, or functionalitycarried out by a microprocessor, microcontroller or other processingdevice, computer or other programmable apparatus.

As will be appreciated by one skilled in the art, aspects of the presentprinciples can be embodied as an apparatus, a system, method or computerreadable medium. Accordingly, aspects of the present principles can takethe form of a hardware embodiment, a software embodiment or anembodiment combining software and hardware aspects. Furthermore, aspectsof the present principles can take the form of a computer readablestorage medium. Any combination of one or more computer readable storagemedium(s) may be utilized.

Aspects of the invention may, for example, at least partly beimplemented in a computer program comprising code portions forperforming steps of the method according to an embodiment of theinvention when run on a programmable apparatus or enabling aprogrammable apparatus to perform functions of an apparatus or systemaccording to an embodiment of the invention.

Further, any shown connection may be a direct or an indirect connection.Furthermore, those skilled in the art will recognize that the boundariesbetween logic blocks are merely illustrative and that alternativeembodiments may merge logic blocks or impose an alternate decompositionof functionality upon various logic blocks.

TABLE J (u₁, u₂, . . . , u₈, u₉) (x₁, x₂, x₃, x₄, x₅) 000000000 02020010000000 21020 001000000 02120 011000000 23130 000100000 01210010100000 23230 001100000 02320 011100000 21320 000010000 02021010010000 23001 001010000 03131 011010000 23131 000110000 03231010110000 23231 001110000 02321 011110000 23301 000001000 02012010001000 23002 001001000 03132 011001000 23132 000101000 03232010101000 23232 001101000 02312 011101000 23302 000011000 02023010011000 21023 001011000 02123 011011000 23103 000111000 01213010111000 23203 001111000 02323 011111000 21323 000000100 02010010000100 21010 001000100 01120 011000100 21130 000100100 01220010100100 21230 001100100 02310 011100100 21310 000010100 02001010010100 21001 001010100 02131 011010100 21131 000110100 02231010110100 21231 001110100 02301 011110100 21301 000001100 02002010001100 21002 001001100 02132 011001100 21132 000101100 02232010101100 21232 001101100 02302 011101100 21302 000011100 02013010011100 21013 001011100 01123 011011100 21113 000111100 01223010111100 21213 001111100 02313 011111100 21313 000000010 03020010000010 23020 001000010 02130 011000010 23110 000100010 02230010100010 23210 001100010 03320 011100010 23320 000010010 03021010010010 21031 001010010 03101 011010010 23101 000110010 03201010110010 23201 001110010 03321 011110010 21331 000001010 03012010001010 21032 001001010 03102 011001010 23102 000101010 03202010101010 23202 001101010 03312 011101010 21332 000011010 03023010011010 23023 001011010 02103 011011010 23113 000111010 02203010111010 23213 001111010 03323 011111010 23323 000000110 03010010000110 23010 001000110 01130 011000110 21120 000100110 01230010100110 21220 001100110 03310 011100110 23310 000010110 03001010010110 21021 001010110 02101 011010110 21101 000110110 02201010110110 21201 001110110 03301 011110110 21321 000001110 03002010001110 21012 001001110 02102 011001110 21102 000101110 02202010101110 21202 001101110 03302 011101110 21312 000011110 03013010011110 23013 001011110 01103 011011110 21123 000111110 01203010111110 21223 001111110 03313 011111110 23313 000000001 02030010000001 21030 001000001 02110 011000001 23120 000100001 02210010100001 23220 001100001 02330 011100001 21330 000010001 02031010010001 23031 001010001 03121 011010001 23121 000110001 03221010110001 23221 001110001 02331 011110001 20301 000001001 02032010001001 23032 001001001 02112 011001001 23112 000101001 02212010101001 23212 001101001 02332 011101001 23332 000011001 02003010011001 23003 001011001 02113 011011001 23123 000111001 02213010111001 23223 001111001 02303 011111001 23303 000000101 01030010000101 20030 001000101 03120 011000101 20130 000100101 03220010100101 20230 001100101 01330 011100101 20330 000010101 01031010010101 23021 001010101 01131 011010101 20131 000110101 01231010110101 20231 001110101 01331 011110101 23321 000001101 01032010001101 23012 001001101 01132 011001101 20132 000101101 01232010101101 20232 001101101 01332 011101101 23312 000011101 01003010011101 20003 001011101 03123 011011101 20103 000111101 03223010111101 20203 001111101 01303 011111101 20303 000000011 01020010000011 20020 001000011 03110 011000011 20120 000100011 03210010100011 20220 001100011 01320 011100011 20320 000010011 01021010010011 20031 001010011 01121 011010011 20121 000110011 01221010110011 20221 001110011 01321 011110011 20331 000001011 01012010001011 20032 001001011 03112 011001011 20112 000101011 03212010101011 20212 001101011 01312 011101011 20332 000011011 01023010011011 20023 001011011 03113 011011011 20123 000111011 03213010111011 20223 001111011 01323 011111011 20323 000000111 01010010000111 20010 001000111 03130 011000111 20110 000100111 03230010100111 20210 001100111 01310 011100111 20310 000010111 01001010010111 20021 001010111 01101 011010111 20101 000110111 01201010110111 20201 001110111 01301 011110111 20321 000001111 01002010001111 20012 001001111 01102 011001111 20102 000101111 01202010101111 20202 001101111 01302 011101111 20312 000011111 01013010011111 20013 001011111 03103 011011111 20113 000111111 03203010111111 20213 001111111 01313 011111111 20313 100000000 12020110000000 32020 101000000 13130 111000000 32120 100100000 13230110100000 31210 101100000 12320 111100000 32320 100010000 13001110010000 32021 101010000 13131 111010000 30131 100110000 13231110110000 30231 101110000 13301 111110000 32321 100001000 13002110001000 32012 101001000 13132 111001000 30132 100101000 13232110101000 30232 101101000 13302 111101000 32312 100011000 12023110011000 32023 101011000 13103 111011000 32123 100111000 13203110111000 32223 101111000 12323 111111000 32323 100000100 12010110000100 32010 101000100 12130 111000100 31120 100100100 12230110100100 31220 101100100 12310 111100100 32310 100010100 12001110010100 32001 101010100 12131 111010100 32131 100110100 12231110110100 32231 101110100 12301 111110100 32301 100001100 12002110001100 32002 101001100 12132 111001100 32132 100101100 12232110101100 32232 101101100 12302 111101100 32302 100011100 12013110011100 32013 101011100 12103 111011100 31123 100111100 12203110111100 31223 101111100 12313 111111100 32313 100000010 13020110000010 30020 101000010 13110 111000010 32130 100100010 13210110100010 32230 101100010 13320 111100010 30320 100010010 12031110010010 30021 101010010 13101 111010010 30101 100110010 13201110110010 30201 101110010 12331 111110010 30321 100001010 12032110001010 30012 101001010 13102 111001010 30102 100101010 13202110101010 30202 101101010 12332 111101010 30312 100011010 13023110011010 30023 101011010 13113 111011010 32103 100111010 13213110111010 32203 101111010 13323 111111010 30323 100000110 13010110000110 30010 101000110 12110 111000110 31130 100100110 12210110100110 31230 101100110 13310 111100110 30310 100010110 12021110010110 30031 101010110 12101 111010110 32101 100110110 12201110110110 32201 101110110 12321 111110110 30331 100001110 12012110001110 30032 101001110 12102 111001110 32102 100101110 12202110101110 32202 101101110 12312 111101110 30332 100011110 13013110011110 30013 101011110 12113 111011110 31103 100111110 12213110111110 31203 101111110 13313 111111110 30313 100000001 12030110000001 32030 101000001 13120 111000001 32110 100100001 13220110100001 32210 101100001 12330 111100001 32330 100010001 13031110010001 32031 101010001 13121 111010001 30121 100110001 13221110110001 30221 101110001 10301 111110001 32331 100001001 13032110001001 32032 101001001 13112 111001001 32112 100101001 13212110101001 32212 101101001 10302 111101001 32332 100011001 13003110011001 32003 101011001 13123 111011001 32113 100111001 13223110111001 32213 101111001 13303 111111001 32303 100000101 10030110000101 31030 101000101 10130 111000101 30120 100100101 10230110100101 30220 101100101 10330 111100101 31330 100010101 13021110010101 31031 101010101 10131 111010101 31131 100110101 10231110110101 31231 101110101 13321 111110101 31331 100001101 13012110001101 31032 101001101 10132 111001101 31132 100101101 10232110101101 31232 101101101 13312 111101101 31332 100011101 12003110011101 31003 101011101 10103 111011101 30123 100111101 10203110111101 30223 101111101 12303 111111101 31303 100000011 10020110000011 31020 101000011 10120 111000011 30110 100100011 10220110100011 30210 101100011 10320 111100011 31320 100010011 10031110010011 31021 101010011 10121 111010011 31121 100110011 10221110110011 31221 101110011 10331 111110011 31321 100001011 10032110001011 31012 101001011 10112 111001011 30112 100101011 10212110101011 30212 101101011 10332 111101011 31312 100011011 10023110011011 31023 101011011 10123 111011011 30113 100111011 10223110111011 30213 101111011 10323 111111011 31323 100000111 10010110000111 31010 101000111 10110 111000111 30130 100100111 10210110100111 30230 101100111 10310 111100111 31310 100010111 10021110010111 31001 101010111 10101 111010111 31101 100110111 10201110110111 31201 101110111 10321 111110111 31301 100001111 10012110001111 31002 101001111 10102 111001111 31102 100101111 10202110101111 31202 101101111 10312 111101111 31302 100011111 10013110011111 31013 101011111 10113 111011111 30103 100111111 10213110111111 30203 101111111 10313 111111111 31313

1. A computer-implemented code book generation method for mapping aplurality of source code words to a plurality of target code words,comprising providing a plurality of source code words and a plurality oftarget code words; grouping the plurality of target code words into aplurality of subsets of the target code words, the target code wordscomprising an identifying portion and a remaining portion, wherein theidentifying portions of the target code words corresponding to a samesubset of the plurality of subsets are identical; selecting a first setof code symbols of the source code words for addressing the plurality ofsubsets; determining for the subsets one or more correspondingneighboring subsets within the plurality of subsets, wherein theidentifying portions of the target code words of the one or moreneighboring subsets differ from the identifying portion of the targetcode words of the corresponding subset by up to a predetermined amountof code symbols; and assigning source code words where the correspondingfirst set of code symbols is associated with the same subset, to targetcode words of said subset such that an amount of the target code wordsof said subset said source codewords are assigned to, having theirremaining portions identical to the corresponding remaining portions ofthe target code words of their neighboring subsets corresponds to acriterion.
 2. The method according to claim 1, comprising removingtarget code words from the plurality of target code words according to adecoding related criterion before grouping the plurality of target codewords into a plurality of subsets of the target code words.
 3. Themethod according to claim 2, wherein according to the decoding relatedcriterion target code words that comprise a run length of identical codesymbols of more than a predefined maximum run length are removed.
 4. Themethod according to claim 3, wherein target code words that comprise arun length of identical code symbols of more than the predefined maximumrun length when being concatenated with another of the target code wordsare removed.
 5. The method according to claim 1, wherein saiddetermining comprises that the identifying portions of the one or moreneighboring subsets differ from the corresponding subset by selectedsymbol flips corresponding to dominant sequencing errors based on asequencing error probability of nucleotides within nucleic acid strands.6. The method according to claim 1, wherein the pluralities of sourcecode words and target code words are divided into source code words andtarget code words of a first code and of a second code, the target codewords of the first code and of the second code both having theproperties that the reverse complementary word of a target code word ofthe corresponding code still belongs to the corresponding code, and thatthere is no common code word between the first code and the second code,and that a target code word of the second code is neither equal to anyportion of two cascaded target code words of the first code nor equal toany portion of cascaded one target code word of the first code and onetarget code word of the second code, and wherein the grouping,selecting, determining and assigning is applied to the first code. 7.The method according to claim 6, wherein the second code is generatedaccording to the following: grouping the plurality of target code wordsof the second code into a plurality of subsets of the target code wordsof the second code, the target code words of the second code comprisingan identifying portion and a remaining portion, wherein the identifyingportions of the target code words of the second code corresponding to asame subset of the plurality of subsets of target code words of thesecond code are identical; selecting a first set of code symbols of thesource code words of the second code to be associated with the pluralityof subsets of target code words of the second code; assigning sourcecode words of the second code where the corresponding first set of codesymbols is associated with the same subset of target code words of thesecond code, to said subset according to a cost function minimizing aHamming distance between the remaining portions of the target code wordsof the second code.
 8. The method according to claim 7, wherein the costfunction depends on a symbol error probability.
 9. The method accordingto claim 8, wherein the symbol error probability is based on asequencing error probability of nucleotides within nucleic acid strands.10. The method according to claim 1, comprising generating at least onecode word sequence from one or more of the target code words; andsynthesizing at least one nucleic acid molecule comprising a segmentwherein a sequence of nucleotides is arranged to correspond to the atleast one code word sequence.
 11. A code generating apparatus formapping a plurality of source code words to a plurality of target codewords, comprising a first input for receiving target code words and asecond input for receiving source code words; a code word grouping unitconfigured to group the plurality of target code words into a pluralityof subsets of the target code words, the target code words comprising anidentifying portion and a remaining portion, wherein the identifyingportions of the target code words corresponding to a same subset of theplurality of subsets are identical; a selection unit connected to thecode word grouping unit and configured to select a first set of codesymbols of the source code words to be associated with the plurality ofsubsets; a determining unit connected to the code word grouping unit andconfigured to determine for the subsets one or more correspondingneighboring subsets within the plurality of subsets, wherein theidentifying portions of the target code words of the one or moreneighboring subsets differ from the identifying portion of the targetcode words of the corresponding subset by up to a predetermined amountof code symbols; and a mapping unit connected to the selection unit andthe determining unit and configured to assign source code words wherethe corresponding first set of code symbols is associated with the samesubset, to target code words of said subset such that an amount of thetarget code words said subset said source code words are assigned to,having their remaining portions identical to the corresponding remainingportions of the target code words of their neighboring subsetscorresponds to a criterion.
 12. The apparatus according to claim 11,comprising a code word sequence generating unit configured to generateat least one code word sequence from one or more of the target codewords; and a synthesizer unit configured to synthesize at least onenucleic acid molecule comprising a segment wherein a sequence ofnucleotides is arranged to correspond to the at least one code wordsequence.
 13. A computer readable storage medium having stored thereininstructions enabling mapping a plurality of source code words to aplurality of target code words, which, when executed by a computer,cause the computer to: provide a plurality of source code words and aplurality of target code words; group the plurality of target code wordsinto a plurality of subsets of the target code words, the target codewords comprising an identifying portion and a remaining portion, whereinthe identifying portions of the target code words corresponding to asame subset of the plurality of subsets are identical; select a firstset of code symbols of the source code words to be associated with theplurality of subsets; determine for the subsets one or morecorresponding neighboring subsets within the plurality of subsets,wherein the identifying portions of the target code words of the one ormore neighboring subsets differ from the identifying portion of thetarget code words of the corresponding subset by up to a predeterminedamount of code symbols; and assign source code words where thecorresponding first set of code symbols is associated with the samesubset, to target code words of said subset said source code words areassigned to, having their remaining portions identical to thecorresponding remaining portions of the target code words of theirneighboring subsets corresponds to a criterion.